ED 141 

'AOTHOH 

/TITLE 

PDB DATE 
•NOTE 



EDRS PRICE ^ 
DESCRIPTORS 

y 



CE Oil 068 



Schwi-nd, Hermann F. 

New Ways to Evaluate Teaching and Training 

Effectiveness. 

Apr 77 

31p.; Paper ^presented at the Adult Education Researph 
Conference (Minneapolis, Minnesota-, AP£il 1977) 

MF-$0-83 HC-$2. 06 \Plus, Postage, 

Behavioral Objectives; 'Behavior Patterns; *Behavior 
Rating Scales; Behavior Standards; Evaluation 
Criteria; *Evaluation Methods.; Literature Reviews; 
♦Measurement Instruments; Performance Specifications; 
Personuel Evaluation; *Task Performance; T^st 
Reliability 



ABSTRACT ^ ' ^ 

This paper discusses the advantages disadvantages 

. cf commonly used measures of job effectiveness, concenrt^^ng on a 
receiit development in the field, the Behaviorally Anchored^Rating 
Scale' (BARS) ; and proposes a new approach, the Behavior Description 
Index (BDI) , which the author cbntends reduces or avoids most of the 
shortcomings of other methods. After discussing the. advantages and 
distinguishing features of BARS, the author refers to the jaain 
problems of currently used instruments that have been cited in the 
literature: Low inter-rater reliability, central tendency 
(inclination of Vrater to avoid ^extreme ratings), halo effect 
(tendency to assign the same rating to each factor being rated) , and 
leniency effect (tendency of supervisors to overrate subordinates). 
Two shortcomings not dealt with in the iiteraturs re viewed ai^e also 
presented: Waste of valuable iafoi^piation and mult idimensionality. The 
paper then examines the characterrstics of the BDI and claims 
advantages of the new scale over other scales '(fo.r example, that the 
BDI uses behavioral criteria; uses a larger sample of the total job 
behavior domain than 'BARS; has ]^ess leniency, halo, and central- 
tendency ^effects; and probably lias higher inter-rat er reliability). 
Implications of th^ use of th^ new instrument in performance and 
training evaluation are discu^fe^ed. I^eferences and examples of 
statements from the BARS are cj^ppended. (LMSJ ' , 



* ' Documents acquired by ERIC include many informal unpublished 

* materials not available from other sources. ERIC makes every effort 
'♦ -to obt^in the best copy a-vailable. Nevertheless ,' items of marginal 

* reproducibility are of ten encountered and this afifect-fe the quality 

* of the microfiche and hardcopy reproductions ERIC makes available 

* via the ERIC Document Reproduction Service (EDRS). EDRS is not 

* responsible for the quality of the original document. Reproductions 

* supplied by EDRS are the best that can be made from the original. 



SUpp-LJ-tiU JJy iLl-rno aj-ts uiic j-rcou. uxxu. u. j-r^ ^^^^ ^ ^ 

********************************************** 



EKLC 



NEW' WAYS TO EVALUATE 
TEACHING AND TRAINING EFFECTIVENESS 



by 



Hermann F,. Schwind 
Assistant Professor ■ 
Faculty of Commerce . 
■\ SAINT MARY'S UNIVERSITY. 
Halifax, Nova. Scotia 
BSH 3C3' 



ABSTRACT " 

The main prbblems and disadvantSges ; of currently used 

performance ^appraisal and training evaluation methods are 

discussed: le^liency^ central tendency, and halo effects, 

- / ^- ■ - • ■ • 

low interr^tei: reliability, and* improper criteria (low 

validity) . A new approach to performance and training 

/ " ■ 

evaluation /is the use of critical incidents in Behaviourally 

• . • ./ • . - 

Anchored Ijtating Scales (BARS) . It is claimed that BARiS 
have more positive characteristics than other scales'; 
e.g. sumitiated rating scales, graphic rating scales, etc. 
However > recent research has sho^^n that under ^certain 
conditions BARS do not demonstrate superior qualities to 
otheryscales . An improved use of critical incidents. is 
described which has all the advantages BARS piossess , but 
"avoids the dis^^^ The, new instrument' is the ^ , , 

Behaviour Description Index (BDI) Which uses behavioural 
criteria, utilizes a larger sample of the total job 
behaviour domain than BARS, has less leniency, halo, and ^ 
central tendency effects, and probably higher interrater 
reliability. Implications of the use of tli^^ new instrument 
in performance and training evaluation are^^^iscussed. 



3 



With the economic slump during the Last few years, 

and resulting budget cuts for educational institutions and 

industrial training programmes, came a new interest in the 

assessment of the quality of instructors and the outcome 

o'f training and development programmes in government and 

industry. While research concerned with evaluation of 

teaching effectiveness has been a going concern for decades 

the evaluation of training programmes "is much in the same 

category Mark Twain placed the weather," to use McGehee^ ^ 

and Thayer ' s words (1961, p. .256). Everybody talks about 

it^, but little is done. There are several possible reasons 

for this, such as 

1. Difficulty in doing controlled studies 
in orgaMzations; 

■ . "'^^V . ■ 

2 . — Gos ts- of training evaiua^OTTS;" " 

■ " ■ ^ . \ 

3. Fear of discovering unwelcome facts on 
training outcomes; 
,» ■ , 

-4. Unwillingness to accept new approaches, 
a "don't rock the boat" attitude.^ 'n 

. ■ ' i 

A seemingly important reason is not mentioned here: the 

■ i - 

lack of a valid and reliable evaluation instriom^nt, a! fact 
which seems to be true also for the evaluation of teacjhing - 
effectiveness.' ' 



A look at currently used evaluationyf orms to measure 
, "teaching ability" iJ.lustrates the problem. As Harari and 

Zedeck (1973) put it: "Current .. .evaluation forms are. 

♦ • .. ' ■ • ■ , ■ 

often ambiguous, verbose, disorganized and arbitrarily 



developed. They consist of global behavioral measures and. 
vague trait descriptions . As- a result, ,the forms tend to 
be unreliable and very susceptible to response biases." 

The situation is worse in the area of ^training 
e^valuation. The majority of business organizations seem 
to use simple reaction measur;ements (How did you like the 
program?) to assess the outcome of their training programs 
(Cataiano and Kirkpatrickr 1968) . It should be obvious 

' that the fact that a participant likes a certain programme 
says nothing about its effectiveness. 

» . ' ■ ^ "t • 

An instructor ' s effectiveness and t^he outcome, of an 

educational or training^'^prograrame are . certaij^ 

related, but we are dealing with two different concepts. 

. Wheh we measure teaching effectiveness we assess an 
instructor's teaching behaviour while the evaluation of 
a programme assesses the degree of a participant's 
behaviour change. The assessment of the first is. based 
on judgements made either by students, colleagues or 
superiors, while the assessment of the second is based on 
observations of job related behavour by superiors or 
neutral observers. 

This paper discusses the ad\^ntages and^ disadvantages 
of commonly used effectiveness measures -and- will coneent-rate 
on the most recent development in this field, the 



Behaviourally Anchored Rating Scale (BARS). A new approach, 
the Behaviour Description Index (BDI) , ^i^J t^e proposed 
which will reduce or avoid most of the shortcomings of 
past and current methods. 



BEHAVIOURALLY ANCHOREE? RATING SCALES 

The Critical Incident Method 

' In his search for a useful tool to. measure work 
performance, Flanagan (1954) developed a technique which' 
he called the Critical Incident Method. He defined the 
critical requirements of a job as those behaviours which 
are crucial in making a difference between doing the job 
effectively and doing itf ineffectively. Critical incidents/,' 
as the term implies, are simply reports by qualified observers 
of the things people did that were especially effective or 
ineffective in accomplishing parts of_ their ^bbs . Such 
incidents are actual behavioural accounts, recorded as 
stoiries or anecdotes and obtained from managers, job 
iricuiabents, and others close to the job being studied. 
In its simplest form the Critical Incident Method consists 

/ - 

of listings of critical incidents which are then compared 
to exhibited behaviour of employees to be rated. The 
underlying objective of this procedure is to obtain a job- 
specif ic scale of behaviour effectiveness. A modified 
version of this approach is the^ Behaviourally Anchored 

Rating Scale(s)' (BARS). 

' • • ■ - 

Development of Behaviour ally Anchored Rating Scales 

The first step in developing BARS is to: ask a sample 
of job incumbents and/or. their immec(iate supervisors and — 
where applicable — subordinates to write* short descriptions 



of an incumbent's job behayij.our they -have either observed 
or heard about. Each behavdotircil description is characterized 
by the specif ication,.o£ a single job , situation" and the 
behaviour in response to that situation.' To be critical, 
an inai.dent.,must occur in, a job situation where the purpose 
or intent of the behaviour is fairly" clear and where its 
consequences are definite enough to leave little doubt 
concerning its effect on job performance (Flanagan/ 1954) • 
Although the latter definition suggests description of 
extreme behaviour, a critical incident need not refer 
exclusively to the extremes of perfo^nnance , ^] 

■ , • . \ . 

I ■ . ' ■ _ 

After they, are gathered, the pool .of behavioural 
incidents are usually edited to conform to an expected 
behaviour format, that is, each description of an incumbent's 
job behaviour is prefaced with the phrase /could be 
expected..,." The intent of the phrase is to allow the 
rater to generalize from what he has seen the ratee do 
in a situation to what he would expect the ratee to do 
in a particular situation, regardless of the opportunity 
to actually observe the ratee. 

After the editing and rephrasing, the incidents are 
categorized. Usually the researcher reads, sorts, 'and 
then labels groups of incidents in terms of similarity, 
or the researcher first qualitatively identifies a set..^ 
of dimensions and then sorts statements according to their ^ 



similarity in meaning^ to the a priori defined dimensions. 
To avoid criterion contcimination through personal biases, 
an ilicident reallocation or "retranslation" procedure is 
used (Smith and Kendatll, 1963) . In the retranslation 
process experts ( job -know ledge able employees) are provided 
with a list of job dimension dj^finitions and asked to assig 
the incidents to the behavioural dimensions they feel they 
describe. Criteria of retention are included in the 
procedure for determining the extent to which a particular 
incident is part of a dimension. Those incidents meeting 
^the criteria are retained for subsequent use. ' 

, . J ■ " ^ 

Following the retranslation 'procedure the incidenvfe 

are rated on a Li]cert-type scale (lisuall^y from 1 to 7, or 

1 to 9) as to the degree of. effectiveness they characterize 

in each job dimension. - . j 

The mean rating for an incident determines its scale 
valiue, while the Standard Deviation (SD) of the mean rating 
is viewed as an index of^ ambiguity. c The procedure requires 
the ambiguous incidents '(i .e . , incidents with an SD in " 
excess of some minimum value) be excluded from the scales. 
The retained incidents are then ordered within the perfoirm- 
ance dimensions they anchor in terms of the,ir mean scale 
values. The usual arrangement of anchors is a vertical 
graphic scale, consisting of a vertical line, market in 
equal-appearing intervals, with incidents arranged along 



its length according to their mean value. Each scafLe is 

I 

headed by a dimension definition and usually omits the 
scale value f or \each incident of behaviour. 'Exhibit 1 is 
an example 6f a scale developed by Das (1975) , for' 
instructors of a' school of busines-s administration. 



• In using the sgale, the rater is instructed first to 
read the dimension definition for a scale. Then, he is 
asked to read each incident start^g at the bottom and 
reading toward the top until he reads an incident that 
exceeds the ratee's "typical" best job behaviour. He. then 
returns, to the highest \" typical incident" and checks it as 



indicative of 



the ratee ' s job performance within that 



dimension of the job. -The value of the incident checked 
by the rater ^^termines the performance score on that 
dimension. ^ , s 



It is possiblp to score the scales in at least, two 
ways. First, if a summation is de^irfed, scores can'.be 
summed across dimensions for a ratee. Second^ if a perfoim 
ance profile is dfesired, the score on each scale can be 
reported. Typically, the latter format is usijed. 

Advantages .and Distinguished Fea1;ures 

Zedeck, et al . (1973 , p. 1) list the following points 
as the advantages and distinguishing features of BARS 
procedures: 



10 



1. Bach scale employe job-related behavioural 
Incldehtd as anchors or reference points; 

,o 

2. Groups with work experiences similar to 
those who eventually us© (or are subjected 

, * to) the scales participate*.,, (in their 
development) , • • ; ^ ' • ' 

3. The terminology^ coOTnbnly used in the... 
(job to be rated)/'l « , is retained in the 
anchors; / 

4. A (reallocation) procedure is used... to 
' ' --educe the ampiguity of the scales...; 

. 5. ' Conceptually indepeniSent scales with high 
scale reli^ility a^r^ obtained; and 

6^ In actual /use, ratinjgs . ^ . (can be) ,» 
document^ with specific incidents.... 

■ / - • ' ' 

^ Smith ahd Kendal/l (1963) suggest that their retranslation 



procedure wil\l lead /to less W-eniency and 

errors. Cummxngs a^d Schwab (197 3^ i^oint out that" the use 



antral 



tendency 



of critical 



inlci 



deii.ts^ may prove to 



be useful in providing 



feedback to ap^rals6rs, since their specificity can'se'rv^e 
a^ a concrete\ ej^Ample, of areas whejre job. behaviours could 
be improved. 



\lntQr-rater Reliability 

Campbell, et al. (1973) and Zedeck and Baker (1972) 
assessed the inter-rater reliability of specific BARS 
instruments . In both studies , inter-rater reliability was 
,low to moderate (i.e.^ r*s ranged from '524 to .55). The 
results suggest" that afc'ceptable levels of inter-rater 
relia^Dility have not ye\t been obtained using BARS procedures. 



11 



ERIC 



Aiifvwwi aL.uuj.v;», iiuwfjVtsjL ; xx: ludy nave ouen cnac cne cestis 
of inter-rater reliability were deficient, that coefficients 
were computed between I'jvols o^ supervisiorif rather t'han 
within. As Campbell, et al , (1973i " suggest , perhaps < the 
different levels of supervision had^dif f erent opportunities 
tfo observe ratine t^l^aviour or differed in perceiving the 
Utility of specific behaviours for meeting job requirements, 
A similar view was expressed by Borman (1974) • In his 
opinion, one should not expect hi^h reliability under such 

circums^tances , since raters at different levels may have 

' ^ i P 

, different ratee performance dimensions which they feel a^e 

relevant. Borman tested hJ^s hypothesis by having ^^cretaries 
and academic instructors ekch independently develop their 
own critical incidents fdr the job of secretary. Both ' 
groups identified different performance dimensions. BARS 
developed from the cri'tical incidents were tlhen- used by the 
two groups to rate the secretaries on all performance 
dimensions. For .each rater group, Borman found that inter- 
rater reliability was higher for the performance dimensions 
the groxip had identified than for the dimensions identified 
by the other groupv. This explanation of the low to moderate 
inter-rater reliabillity of^ARS is plausible. However 
there is a second possible e^xplanation . It will be 
discussed under "Shortcomings of BARS." 



uenrrax i-enaency - ^ ^ ^ 

No studies on BARS so far feeem^to have paid much 
attention to the problen 0t central tendency. It ciescribes 
the inclination of raters to avoid extreme ratings . on a 
scale, e.g. "Outstanding", or "Very Poor." BARS should be 
less prone to this problem because of their behaviour 
specificity. .Since often independent behaviour. samples 
are utilized, a rater has to make a choice; i.e. hq is 



able To avoid extreme ratings. (See chapter on "Shortcomings 



of BARS".) 



Research so far has shown that BARS seem to have a • 
slight advantage^over other methods* of performance evaluation. 
However,, the question has to be asked whetiier these . findings' 
illustrate a genuine lack of superiority or whether 'the 
methodology used^.in the comparison studies is l^ss than 

adequate. As mentioned above, Schwab, et al, (1975) 

• • » 

criticize the uise *of only two instruments for comparison 

purposes. Another problem may lie in the* standard. deviation 

criterion used to select the critical incidents. * In most^ 

cases the priterion was set so that it indicated a substantial 

amount of^disagreement among judgfes as to the level of 

ef fectivenegSv the behaviour didscribed, typically 1.50, i.75 

or even 2.0 starndard deviations. This may suggest that 

th'e critical inciden^s..^^lected ^f or the instrument , were 

not the unambiguous bQhavio^.^|amples the creators had 



hoped for .-"-A third questionable are^-may^ be calculation 



* nature of^the group being evaluated. . ' 

Halo Effect • 
. » _ The halo effect appears in evaluation when the- 

evaluation tends to assign the same rating or level to 
\, , eac;h factor being rated (Glueck, 1974) • / 



/ , \. Burnaska and Hollman (1974) found that BARS resulted 
% • 

/ in less halq than ^ nuJnerxcally anchored and adjective 

scale. However, they point out that all/ three scales had 

excessive levels of halo. In. a study comparing. BARS and 

a humerically anchored scale, Campbell, et al . (1973) 

found that the former scale format showed less -halo than 

the :|.at,ter. Similar results were reported by Groner (1974)"r 

Borman & Dunne tte (1975) ^/' and Keaveny and McGaun (1975) . 

On the other hand, Borman! and Vallon (1974)- found'' no 

differences in halo effect, between BARS and other non- 

behavioural scales. Similar to the critique on the approach 

^to measure leniency, Schwab, et.al. (1975) argue against 

the use of only two instruments to study the halo effect: I 

"If one begins with the reasonable assumption ,that. 
performance ^n various dimensions is inter-related, 
then compari/son of the intercorrelations generated 
by jiist two /instruments, provides little basis for 
deciding thjfe actual or true inter-relations in 
the group appraised . " (p. 560) 



14 



Supervisors tend to overrate^ their subordinates; i.e. 
the teYidency is to be lenient rather than str ict. The 
result cpf this "error" is that - the average performance . 
rating is not at the midpoint of a scale but on the 
ppsi-^ive side of it. . . _ 

Smith and Kendall (1963). suggest that BARS ishould he 
less susceptible to leniency effects because of the 

unambiguous dimensions and anchors developed by. the 

procedure. However, research results are equivocal on 
this issue. Campbell ,'^et al. (1973) found -that the ipean 
ratings on their instrument were, on average closer to 
the midpoints of the scales than ' those "'of . a sumiua ted rati rig 
scale. On the other hand, Borman and Va^llon (1974) f ouad - 
that a group of employees had significantly higher ratings 
on a behaviourally anchored scale . Campbell, et al, 
interpreted their results to mean that BARS demonstrated 
less leniency while Borman and Vallon concluded that BARS 
showied greater leniency error. Schwab, et^.al. (1975) 
point out that it may be risky to make inferences about 
relative leniency effects using only two instruments in 
a comparative study because "It is not possible to determine 
what the 1;rue average rating should have been" (p. 559) . 
For this, reason, Schwab, et al . \(1975) recomitiend us: ng 
more than two* sets of measures in the evaluation process. 
A greater number of measures would allow nlore comparisons 



raters consisting of employees "^rom different levels; e.g. 

^SUperi O'*"^ anr ] gnho-rrli n;q4-AQ n f__r-j:^-#=>Acs-, £^ f th e-above— 



men:tioned problems are corrected, BARS may demonstrate y 
a more significant advantage over other measures. / 



Shortcomings of Behaviourally Anchored Ratihg Scales 

In the introduction, several references have been 
cited which suggest that BARS have a nuiriber of advantages 
over ^ traditional performance rating methods. The. advantages 
claimed are: higher job specificity, higher motivation of 
raters, higher acceptance of ratings by ratees^, higher 
dimension independence, less halo, less leniency, and less 
central tendency. However, there seem to be ..at^^least' two 
shortcomings of *th,e BARS technique which have not bw 
discussed in the previous literature review: ' ^ 

1. waste of valuable information; - 

2. ' multidimensionality . . 

Waste of Valuable Information ; . ' 

■ ' ■ — > ' 

After development , critical incidents /for a job are 



put through the validation and retranslati^n- process , and 
they must^ fulfil the standard deviatior^v criterion. Usually 
20 to 5 0 critical incidents p^r job dimer^sion survive. Yet 
only- between 5 and 10 , - depending on the /number of anchoring 
points of the scale, are utilized, all others are thrown Out 
Undoubtedly, those items whiph are not used contaii^ valuable 



16 



information about the job dimension to which they were 
attributed in the retrahslation process. The decision to 

criteria: a convenient mean value to fit the scale points,, 
and the degree of agreement between raters as measured by 
the standard deviation. . 



Multidimensionality 

A second problem with BARS has to do with the. use of . 
independent critical incidents in behaviour dimensions. 
Behaviour dimensions are important aspects of a total job 
behaviour domain which, in turn, is composed of all possible 
relevant job behaviours. "I'his instructor always uses the-, 
blackboard' to illustrate a problem" i.s a.- sample of-ihe - 
behaviour dimension "instructor in class" of an instructor' 
.job behaviour domain. Wallace and Schwab (1973) found 'five 
behaviour dimensions for an instructor in a school of 
business (see Exhibit 2), while Das (1975) identified 18 
^inig^nsions (see Exhibit 3). When critical incidents are ^ 
generated, the intention is to sample to a significant 
degree the behaviour domain of a job .dimension. The 
problem is that there are often so many job dimensions 
that it is impractical to develop scales for each one, sihce 
a rater very likely will refuse to evalukte a ratee on 30 
or 40 dimensions. For this reason, job dimensions usually 
are collapse^ into a mana^able number, e.g. 5 of 10. As 
a consequence of this approach most scales utilizing 



■ ■■■ \ ■ . 
■ \ 

:critical incidents use independent behaviour samples, thus 
forcing the rater to make a dif ficult\choice , opening up • 



the rating procedure to possible biased, Like leniency and 
halo. ' , 

\ 

To illustrate. the problem, an example is taken from 
Das (1975), He identified 18 job behavioi^r^^ dimensions of 
an instructor. A BARS for one of the dimensions is shown 
in Exhibit 4.. A comparison of the seven behaviour samples , 
reveals that behaviour #1 is conceivab^ly independent of, . 
\^ the behaviours #2,* 3, and 4^..^ With other words, it is . 
X^possible that a rater can choose all these behaviours as . 

\typical" and not just. one. On the other hand, behaviour 
■/samples #1, 5, .6 and, J are mutually- exclusive (dependent). 
Ideally, a behaviour dimension consists only of mutually 
' exclusive or unidimensionaX:: behaviours . Otherwise the 
rater has^ to choose between different possible behaviours 
which leaves the ins trvmfent open, to response biases and 
will result in low reli^ilityc ^ 

Multidimensionality very likely is alSo one of the 
causes of the central tendency effect BARS seems to exhibit 
although to a less^r^. degree than other common scales 
(Campbell, et al . , 1973). Since a rater has . independent 
choices, it is possible for him to avoid extreme ratings. 
The same characteristic may ttlso be the cause of* the halo 
error. > . _ 



18 



A NEW PERFORMANCE EVALUATION SCALE : 
THE BEHAVIOUR DESCRIPTION INDEX 



Characteristics of the New Scale 

Jcnstead of the usual utilization of only 5 9 behaviour 
descriptions in a BARS, it is proposed that a larger sample 
ofVthe total behaviour domain of a job dimension be used 
(see Exhibit 5). The number of critical incidents could be 
deteirmined by the total number of critical incidents 
generated per job dimension. The number. of incidents 
utilized' per scale will be limited only, by 'fatigue effects 
of raters. It is expected that for practica:l\purposes the 
maximum number of behaviour descriptions will bte equal to » 
or less than 20. The utilizatioix. of a larger number of 

critical incidents would overcome or at least reduce one 

■/ ■ ' ' ' • 

of the major shortcromings of BARS discussed before': . the 

■ / ■ ■ \ ■ ' \ ■ ■ . . 

loss of information. , It is conceivable that m many 
insxances 20 critical incidents will encompass the total 

••■ ./■ \ • , . - ■ ■ ; ' ■■ ' ■ 

behaviour domain of a job dimehsion. . If it is not the case, 
thdh a't least a much better .sampling can be done. -If the ' 
total domain were .40, then 20 items represent 50% *as ' 
compared to 17.5% with^a. sample of 7. 

In order to avoid tihe disadvantages of graphic'- rating 
scales and BARS, a forced choice scoring! is*' suggested. - 
Raters will respond to the question: Do,es the ratee exhi1i)it 
the below described behaviour, yes or no? In the case the 



rater is not sure, he can respond with a question mark. 
After th e re s pon s e s h e e±-- i s c omp 1 e 1 4d-, _ . a 11 ratings will ..be 
converted into points. (For details see chapter '"The 
Rating Procedure for the BDI".) Thjis conversion could be 
done by a different per^^pn than the rater or through a 
computer program. This approach snould have a drastic 
influence on the halo effect. Since the rater does not 
know whether he evaluated the ratejb high or low on a scale, 
it is very difficult or impdssible for him to transfer a 
general characteristic from one scale to another., It could 
be argued that the rater will know how he evaluates a ratee 
if he' responds only' positively to critical incidents. 
However ,H^, since positive and negalpjlve critical incidents 
will be randomly mixed in a scale Ithe rater must really 
concentrate on his' responses if he\ wants to bias the 
evaluation. He still does not J^now the actual score (see 
Rating Procedure for^the BDI),. 

• Another possible advantage of the new scale would be 
/ a reduced or eliminated leniency ^ef feet, largely because ' 
of the same reasons described above: mixing of a larger 
number of pQsitive and negative statements and independent 
"scoring. Only a consciously false response could induce 
a leniency effect. But no scale is immune against wilful 
misuse. ' ' ' . / " 



\ 

\ 



20 



\ 



\ 
\ 



A fourth improvement as compared to traditional BARS 
would be the virtual/ elimination of the central tendency 
effect. Since the BDI does not use conti-nuous scales with 

extreme ratings on either end, the cause 'of any possible 

. . ' ' ' ■ ■ ■ ■ • ] 

central tendency /effect is removed. 

/ - . ; ■■ ■ ' 

There may/be a fifth advantage of the BDI over BARS. 
It has been suggested that the' low to moderate inter-rater 
reliability /Of BARS may be. caused by using, raters from 
different c/rganizational levels (e.g., Campbell, e^t:^al./ 
1973). Nobody so far has pointed to the possibility that 
the mul-^idimensionality problem may be either a contributing 
or eyeri a major factor in pausing the low reliability. It^ 

is quite possible that the BDI will reduce the problem 

. ■ / ' - ' ■ . ■ ■ ■ ■ ■ i ' 

smde the rater is not forced to choose one from several 

possible items, but he may check off as many items as are 

available. A superior could determine in advance what 

'^scores, would be acceptable or unacceptable, e.g.f'out of 

60 poissible points: (.20 litems x 3 points) 

0 - 30 may mean: urgent training required. 

(or, if measured after training: 
training iheffectiver ' 

31 - 40 may mean: training recommended 

41 - 50 may mean : refresher couriSe may be useful 

51 - 60 may mean: no training required. 



in summary., the new BDI scale seenis to offer the 
following advantages: ''\ . 



!• .increased information content by improving 
sampling of the behaviour domain; 

2. reduced or no halo effect; 

3. reduced or no. leniency effect; 

4. no central tendency effect; 

5. higher inter-rater reliability. 



Rating Prt)cedure for the BDI^ 

The BDI use^s-positibe and negative critical incidents 



i \. 



. in random order. The number of . positive and negative 
statements does not influence the score because .of the /j 
scoring characteristics. If a rater responds positivel'y 
(Yes) to a positively worded stai:ement, or 'negatively /(No) 
to a negatively worded statement, the score will' be 3 /points 
A positive response to a negative statement and vi<5e yersa 
results in 0 points. If the respondent is nob sure or 
cannot decide, the response is a question mark '(?) athd 

the score will be 1 point. Again, it will be emphasized 

" ■ ■ , ■ . — /'■■■// 

that the conversion of the rati^ngs (Yes, No, ?) to /point 
scores (3^ .0, 1) is probably not done by the rater, but 

■ ..i ■■■ . ■ 1 ■ ■■/■ ■ ^ ■ 

a second person, br more likely by a computer, especially 
if the number of ratees is large. It is assumed/-- pending 

empirical investigation — that. the job dimensions are 

.. • ' • ' ' . ■ ■ ■ - ■ i -r^ ■ 

relatively independent. For this -reason the pciint scores 

will be totalled for each scale sfepaxately. l ' / 



CONCLUSION AND. IMPLICATIONS. 

A new approach to teaching and training evaluation 
has been discussed. The characteristics of the new 
instrument — the Behaviour Description Index — ; seem to 
be superior to conventional performance appraisal, e.g. 
suminated or graphic rating scales and BARS, 

If the instrument can be empirically validated 
-there is little doubt that it will — it should prove to 
be a significant improvement in the evaluation process. 
Instead of relying on vague trait characteristics which 
mean diffeirefit things to di f f erent:people, very specific 
behaviours are descrj.jbejd, which can easily be observed,! 

Secondly, a large part of the total behaviour domain 
can be utilized, enabling raters to pinpoint shortcomings 
of instructors or training participants ,. thus making it 
easier to take^ corrective actions, e.g, retrai.ning or 
counselling,. ' * :r ^ 

_ There|^ are other pbssilDle uses of the BDI." Much has 

been . writtien about the vagueness of job descriptions . : 
What could be a better solution to this problem by handing 
a new job incumbent together with the description of his 
responsibilities a copy of a BDI of his job? He would 
find samples of effective .and ineffective job behaviour 



and would know immediately wHat\ is expected o£ him,. ^ Other 
areas of application could be performance appraisal and 
determining of training needs. Actually, the BDI could 
be the .basis for a new systems approach in ' the personnel 
management area. Future research will show whether this 
is possible. The first indications are certainly 
encouraging. 



FOOTNOTES 



1. Fof a detailed discussion of t±Lese problems ^ see 
Schwind, 1975, a and b. . 

/ 

2. This is a similar approach as described in 
/Smith, Kendall and Hulin, 1969. 



I 

• '/ 



25 



References ■ ' , 

ifiprmany WjC. , "The ratings of individuals in organizations: ' . S- 
An alternative approach." Organizational Behavior and 
Human Performance^ 12 (1S74) 105 - 124.^ ■ • - : 

Borman^ W.C. ^ »and Dimnette'^^ M.D.v : "Behavior based versus ; 

traitpriented perfermance ratings: An empirical s.tudyL^ ' 
Journal of Applied Psychology ^ 60 ' (1975) 561 k 5:^5 ^ 

Borman, W.C. , and Vallon, W,R. ^ "A view of what carij happen ■ ■ 
when behavioral expectation scales are developed iw pne . ■/ 
setting and used in another." Journal of Applied * 
Psychology, 59 (1974) 197 - 201. 

Burnaska, R.F. , and Hollman> T^b. , "An empirical comparison / 
of the relative affects of rater response biases or 
three ratings scale formats . " Journal of Applied 
Psychology, 59 (i974)^-'307 - 312. 

Campbell, J.P. , and Dunnette, M.D. , and Arvey/R.D., and' 
Heliervik, L.N. , . "The development and evaluation of 
/behaviorally based ratingrs scales." Journal of Applied 
. Psychology, 57, (1973) il5 - 22. 

Catalano; R.F. , and Kirkpatrick, Q.L. , "Evaluating Training 

Programs" Training and Development Journal, May 1968. ^- ' 

Cummings, L.L. , and Schwab, D.P. , Performance in! Organizations ; 

Determinants and AppI^aisal , Glenviev, 111.: Scott,. Foresman 
and Co. , 1973. 

Das, H., "Behaviorally Anchored Rating Scales in Teaching 

Evaluation", Unpublished Master Tresis, University of v 
British Columbia, Faculty of Commerce, 1975. ^ 

Flanagan, J.C., "The critical incident technique." Psychological 
Buiretin (1974) 327 - 358. / 

Glueck, W.F. , Personnel, A Diagnostic Approach , 1)allas, Texas: 
Business Publications , 1974 . 

Groner, D.M. , "Reliability and susceptibility to bias of 

behavioral and graphic rating scales." Unpublished c 
doctoral dissertation. University of Minnesota, 1974. 

Harari , D. , and Zedeck, S. , "Development of Brhaviorally 

Anchored Rating Scales for the Evaluation -^f Faculty 

Teaching. " Journal^ of Applied Psychology 58' (19i73) , 
, 261- 265. 



\ . , . , ■ • . . .. A •' 

Keaveny, T.J. , and McGaun, F. , "A comparison of behavioral 
expectation scales and graphic rating scales." Journal 
of Applied Psychology, 60 (19750 695 - 703. ^ 

McGehee, V7. , and Thayer, P.W. , "Training in Business and 
Industry" John. Wiley & Sons , Inc., N.Y. 1961 

p'. . . • . 

* Schwab, b.p.*/-and Heneman, , and DeCotiis, T.A. , "Behaviiprally 
anchored ratings scales: A review of the literature . " 
Academy, of Management, Proceedings (1975) 222, ~ 224. 

_,--Sehwind, H.F. , "Thoughts on Training Evaluation. " , Canadian - -^ 
■ Training" Methpds, ^8 ( ^975) #1, 14 -15. 

, Smith/ P.C. , and Kendall, 'L.M. , "Retranslation of expectations: 
- An approach to 'the* construction of unambiguous anchors 
' for ratings scales." Journal of Applied Psychology, 47 
: • / (-1963) /149 - 155. 

'Smith, P.C. , Kendall, L.M. , and Hulin, C.L./ "The Measurement 
^ of Satisfaction i«n Work and Retirement." Rand McNally ^ 

. & Go. , Chficago, 111. , 1969. . 

Wallace, M,0.> and Schwab, D.P. , "The Validation of Teaching- 
: ^ Effectiveness Measure in Two Business Schools" Academy 
. of Management Proceedings, . 1973. • 

Zedeck,- S., and Baker, T. > "Nursing performance^ as measured 

by behavioral expectation scales: A multitrait-nvultirater 
analysis;"^ Organizational ^Behavior and Human Performance, 
7 (1972)^ 457 - 466. • ^ 

"^Zedeck, S., and Imperato, N., and Krausz, M. / ^and Oleno, T., 
* "Development of behavio3?ally anchored rating scales as 
a function or organizational level." Journal of Applied 
Psychology, 59 (1974) 249 - 252. 



Exhibit 1 

N. • 

Job Dimensions of Department Manager 
<» 

1. Supervising sales personnel * 

2. Handling, customer complaints, and making, adjustments 

3. Meeting day-to-day deadlines " , 

4. ' Merchandise ordering 

5. Developing anc(^ planning special promotions 

6. Assessing sales trends and acting to maintain merchandising 
position ^ 

7. Using company systems and following through on administrative 
operations 

*8. Coiranunicating relevant information to associates and to 
higher management 

9. Diagnosing and alleviating Special department problems. 



from Campb'ell, J. P., Dunnette, M.D., Arvey, R.D., and Hellervik, 
L.V,, "The Development and Evaluation of Behaviorally Based 
Rating "Scales." Journal of Applied Psychology, Vol. 57/ No. 1 
(1973) 15-22. 



EXHIBI,T 2 

DIMENSIONS OF TEACHER BEHAVIOUR IDENTIFIED 
* Ii<OM INCIDENTS GIVEN BY STUDENTS 




1. Instructor in class 

2 . Required reading ■ 

3. ' Subject matter 

4. Instructor in general 

5 . Assignments and examinations 



EXHIBIT 3 



DIMENSIONS (CATEGORIES) OF TEACHER BEHAVIOUR 
IDENTIFIED FROM INCIDENTS GIVEN BY STUDENTS 



Cc5Tir.se Out-li.ning and Structuring 

Administrative. Handling 

Coverage Of Material 

Teaching , Style 

Teaching Methods 

Evalu:ition 

Interaction Outside Class 
Flexibility and Responsiveness 



DItffiNSIONS (CATEGORIES) OF TEACHER BEHAVIOUR. \ 
IDENTIFIED FROM INCIDENTS' GIVEN BY PROFESSORS 



' 9. Interaction With Colleagues 

10. Interaction With Students Oiitside Class 

11. Behaviour in ..The Class Roorh ' / 

12. Research Activities / 

13. Handling AQministrativo Matters 



DIMENSIONS (CATEGORIES) OF TEACHER BERAVIOUR 
IDENTIFIED FROM INCIDENTS , GIVEN BY ADMINISTRATIVE ST- 

14. Facilitation Of Administrative Work Flow,,. j 

15 . Controlling Expenditures . . [ 

16. . Adherence To Policies • 

17. Providing Feedback To The Staff ; 

18. Counselling Activities * / 



i 




EXHIBIT 4 

I • ■ ' 

BEHAVIOURALLY ANCHORED RATING SCALE 




Could be expected to give course 
outlines and schedules to students 



Could be expected to set specific 
targets for each'' session. , 



ass 



Could be' expected to set clc 
participation as an evaluation 
criterion without clearly stating his 
expectations . | 



Could be expected^jto specifically state 
requirements .for t'hi^ course and use 
definite afid stated' criteria for ^ 
evaluation. , 



Could be expected toehold an introductoj 
session in which the students' 
expectations are ascertained and 
instructor's course objectives are made 
cl4ar . . 



Could be expected to take quite a few 
days to telL the students what is the 
course content of the course. 



Could be expected to announce mid-way 
through that there would be a final 
exam, in contradiction to his earlier 



ERIC 



